Generated 2026-02-14 22:31
Chatterbox VC by Resemble AI is a zero-shot voice conversion model.
It encodes the source audio into discrete S3 speech tokens (capturing content and prosody), extracts a speaker embedding from a short reference clip, then decodes a new waveform via a flow-matching model that sounds like the target speaker saying the source content.
No training or fine-tuning needed — just a few seconds of reference audio.